1. Overview

The riparian plot is the primary method to visualize syntenic relationships among >2 species. GENESPACE offers this functionality and considerable customization opportunities. Here, we will illustrate some of these possibilities.

Typically, the user will run the GENESPACE pipeline via run_genespace(gsParam = x), then call the function plot_riparian(gsParam = x) where x is the GENESPACE parameter list. This method has the benefit of using the pre-checked file paths stored in x and can thus calculate and access the block coordinates without any user specifications. Since plot_riparian() requires access to the ‘syntenic hits’ text files, which can be very large, this tutorial does not include the source data to repeat the exact analyses. Instead users will have to construct the entire GENESPACE run from the scripts included below. The full run takes about 20 minutes to complete.

library(GENESPACE)
## GENESPACE v1.1.10: synteny and orthology constrained comparative
##         genomics

2. Get your GENESPACE run together

NOTE This section is copied from the primary GENESPACE guide. If you have already run that, no need to do this part.

Set the paths for your run. Change these paths to those valid on your system

genomeRepo <- "~/path/to/store/rawGenomes"
wd <- "~/path/to/genespace/workingDirectory"
path2mcscanx <- "~/path/to/MCScanX/"

Download the data of interest. Here we are going to use human, mouse, platypus, chicken, and sand lizard. To do this programatically, get the URLs to the genome annotations, but leave off the ‘genomic.gff.gz’ string at the end, then make the full paths by appending ‘translated_cds.faa.gz’ and ‘genomic.gff.gz’ to the ends of the paths. Finally make the directories to write to and download the files.

urls <- c(
  human ="000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_",
  mouse = "000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_",
  platypus = "004/115/215/GCF_004115215.2_mOrnAna1.pri.v4/GCF_004115215.2_mOrnAna1.pri.v4_",
  chicken = "016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_",
  sandLizard = "009/819/535/GCF_009819535.1_rLacAgi1.pri/GCF_009819535.1_rLacAgi1.pri_")

genomes2run <- names(urls)
urls <- file.path("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF", urls)
translatedCDS <- sprintf("%stranslated_cds.faa.gz", urls)
geneGff <- sprintf("%sgenomic.gff.gz", urls)

names(translatedCDS) <- genomes2run
names(geneGff) <- genomes2run
writeDirs <- file.path(genomeRepo, genomes2run)
names(writeDirs) <- genomes2run
for(i in genomes2run){
  print(i)
  if(!dir.exists(writeDirs[i]))
    dir.create(writeDirs[i])
  download.file(
    url = geneGff[i], 
    destfile = file.path(writeDirs[i], basename(geneGff[i])))
  download.file(
    url = translatedCDS[i], 
    destfile = file.path(writeDirs[i], basename(translatedCDS[i])))
}

Parse the annotations to fastas with headers that match a gene bed file

genomes2run <- c("human", "mouse", "platypus", "chicken", "sandLizard")
parsedPaths <- parse_annotations(
  rawGenomeRepo = genomeRepo,
  genomeDirs = genomes2run,
  genomeIDs = genomes2run,
  presets = "ncbi",
  genespaceWd = wd)

Conduct the GENESPACE run … depending on your machine, this can take a few minutes up to an hour.

gpar <- init_genespace(
  wd = wd,
  ploidy = 1, 
  path2mcscanx = path2mcscanx)
out <- run_genespace(gpar, overwrite = T)
## Loading objects:
##   gsParam

2. Default behavior

Here is what the default plot looks like using the human genome as the reference. Each syntenic block, phased and colored by the human reference genome chromosomes, are visualized as braids. Each chromosome is a rounded rectangle polygon.

ripDat <- plot_riparian(
  gsParam = out, 
  refGenome = "human", 
  forceRecalcBlocks = FALSE)

The forceRecalcBlocks argument (default = TRUE) ignores any existing reference-phased block coordiates and re-calculates them according to the specifications in the plot_riparian function call. In general, this is a good idea to ensure that your parameters are being respected; however, it can be much faster to not recalculate block coordinates, especially in large runs.

3. Customizing what is plotted

For a full list of all the parameters, see the end of this tutorial.

3.1 Change the type of synteny that is plotted

Instead of aggregated syntenic regions, we can plot just the fully collinear syntenic blocks using useRegions = FALSE. We can also use physical position instead of gene rank order with useOrder = FALSE.

ripDat <- plot_riparian(
  out, 
  refGenome = "human",
  useOrder = FALSE, 
  useRegions = FALSE)

3.2 Alter the genomes (and their order) and change the reference

By default, the genomes are ordered following the positions in the species tree from orthofinder. We can adjust this using the genomeIDs parameter. We can also switch up which genome is used to phase the blocks with refGenome.

ripDat <- plot_riparian(
  gsParam = out, 
  refGenome = "mouse",
  genomeIDs = c("mouse", "human", "platypus", "chicken"), 
  forceRecalcBlocks = FALSE)

3.3 Adjust how synteny is used to order chromosomes against the reference

By default, syntenyWeight = 0.5, which means that chromsomes are re-ordered across the genomes to maximize synteny while also preserving local ordering of chromosomes by the order of chromosome names. For example, if both Chr1 and 2 are at similar syntenic positions, but Chr2 is slightly more to the left than Chr1, Chr1 will still come first in the synteny plot. If syntenyWeight = 1, only synteny is considered and names are ignored. If syntenyWeight = 0 (equivalent to reorderBySynteny = FALSE), only chromosome names are used and synteny is ignored.

ripDat <- plot_riparian(
  gsParam = out, 
  #reorderBySynteny = FALSE,
  syntenyWeight = 0,
  refGenome = "human")

3.4 Manually set reference chromosome order

We can also change how the chromosomes are ordered using two approaches. Either using a custom order for the reference genome with customRefChrOrder.

ripDat <- plot_riparian(
  gsParam = out, 
  refGenome = "human",
  customRefChrOrder = c("X", 1:22))

Or alternatively, by supplying a function that re-orders the all the chromosomes. By default, refChrOrdFun orders the reference genome chromosomes so that first chromosomes with smaller numeric values are on the left, then break the ties by chromosomes that come earlier in the alphabet are to the left, and if still tied, choose a random one. But we can mix this up. For example, any chromosomes that are not numbered by an integer go first, then the integers:

ordFun <- function(x) 
  data.table::frank(
    list(!grepl("[a-zA-Z]", x), 
         as.numeric(gsub("[a-zA-Z]", "", x))), 
    ties.method = "random")
ripDat <- plot_riparian(
  gsParam = out, 
  refChrOrdFun = ordFun,
  refGenome = "human",
  reorderBySynteny = FALSE)

3.5 Include even tiny chromosomes/scaffolds

By default, plot_riparian(minChrLen2plot = ifelse(useOrder, 100, 1e5)) excludes any scaffolds that have < 100 genes (is using gene rank order) or < 100kb (if using physical position). We can plot all chromosomes by setting this to 0, or only plot larger chromosomes by increasing this value. For example, show even the tiny little scaffolds:

ripDat <- plot_riparian(
  gsParam = out,
  minChrLen2plot = 0)

3.6 Highlight inversions

In some cases (e.g. closely related genomes), it may be desirable to highlight where inverted sequences exist. We can do this with inversionColor, which is set to NULL by default. If given a color, this will highlight inverted blocks.

ripDat <- plot_riparian(
  gsParam = out, 
  genomeIDs = c("chicken", "sandLizard"), 
  refGenome = "chicken", 
  inversionColor = "green")

4. Customizing the appearence of the plot

4.1 gaps between chromosomes and appearence of chromosome polygons

Depending on the differences in genome sizes, we can also adjust how the gaps between chromosomes are treated. The two parameters that do this are gapProp, which controls the gap size between chromosomes in the largest genome, and scaleGapSize, which controls how large the gaps are for smaller genomes. Here, we double the gap size and double the amount of spread between other genomes. This way, chromosomes are more evenly distributed from top to bottom in the plot.

To improve clarity, we can also add some spacing between the ends of the braids. These normally end right at the middle of the chromosome polygons. However, we can bring them out further with scaleBraidGap, which adds a gap proportional to the height of the chromosome polygons. This probably should be used in conjunction with larger gaps via gapProp (to make the breaks more obvious). Its also sometimes better to make the chromosome polygons less rounded and more like a rectangle so the bounds are clearer when the braids don’t go under the chromosome polygons.

ripDat <- plot_riparian(
  gsParam = out, 
  gapProp = 0.02, 
  scaleBraidGap = 2,
  scaleGapSize = .5,
  howSquare = 2,
  refGenome = "human")

4.2 Invert chromosomes

In some cases, it can improve clarity by inverting certain chromosomes. For example, the mouse X chromosome is somewhat reversed in order relative to platypus and humans. Lets invert the order of that one and a few others to illustrate how this works. Inverted chromosomes get a * flag next to their names. We can also plot more scaffolds with few genes on them by decreasing minChrLen2plot, which has a default of 100 genes or 100kb depending on if you are using gene order or not. In some cases, it may be benefical to only show the chromosome IDs for certain genomes (like if all of the genomes are highly syntenic). You can choose which genomes to be labeled with labelTheseGenomes. The unlabeled chromosomes get smaller in this case to make it easier to view. We can change size of the text with chrLabFontSize.

invchr <- data.frame(
  genome = c("mouse", "mouse", "mouse", "mouse", "chicken"), 
  chr = c(4, 3, 1, "X", 5))
ripDat <- plot_riparian(
  gsParam = out, 
  minChrLen2plot = 10,
  invertTheseChrs = invchr,
  refGenome = "human",
  chrLabFontSize = 7,
  labelTheseGenomes = c("human", "mouse", "chicken"))

4.3 Customizing the labeling etc.

Genomes with low levels of synteny, like the case here can be hard to visualize. It may be helpful to adjust the coloring and transparency of the braids. This can be accomplished with palette and braidAlpha. We can also make a dark palette to put on a white background andchange the color of the chromosomes using chrFill

ggthemes <- ggplot2::theme(
  panel.background = ggplot2::element_rect(fill = "white"))
customPal <- colorRampPalette(
  c("darkorange", "skyblue", "darkblue", "purple", "darkred", "salmon"))
ripDat <- plot_riparian(
  gsParam = out, 
  palette = customPal,
  braidAlpha = .75,
  chrFill = "lightgrey",
  addThemes = ggthemes,
  refGenome = "human")

5. Highlighting specific regions of interest (ROI)

5.1 Plotting ROI including the entire background synteny map

We can also use the riparian plot to highlight specific regions of interest (e.g. Fig. 2 here). Lets repeat this analysis with the five test genomes here. To mimic figure 2, we have to add three other customizations: use human as the reference genome, order the human chromosomes as X, 1-22, and order the genomes with sandLizard at the bottom.

roi <- data.frame(
  genome = c("human", "chicken"), 
  chr = c("X", "Z"), 
  color = c("#FAAA1D", "#17B5C5"))
ripDat <- plot_riparian(
  gsParam = out, 
  highlightBed = roi,
  refGenome = "human", 
  genomeIDs = c("sandLizard", "chicken", "human", "mouse", "platypus"),
  customRefChrOrder = c("X", 1:22))

5.2 Plotting ROIs only

You can also only plot the regions of interest without the background synteny by setting backgroundColor to NULL.

ripDat <- plot_riparian(
  gsParam = out, 
  highlightBed = roi, 
  backgroundColor = NULL, 
  genomeIDs = c("sandLizard", "chicken", "human", "mouse", "platypus"),
  refGenome = "human", 
  customRefChrOrder = c("X", 1:22))

6. Parameter descriptions

plot_riparian(...) defaults to the following basic parameters:

6.1 What to plot

  • useOrder = TRUE: the synteny map and chromosome sizes reflect gene rank order (and not physical basepair position)
  • useRegions = TRUE: plot syntenic blocks aggregated by gsParam$params$blkBuffer radius. This produces less complex synteny maps, but can sometimes yield regions that appear to be overlapping when they are not.
  • genomeIDs = gsParam$genomeIDs: plot all of the genomes that were used in the GENESPACE run

6.2 Ordering chromosomes by synteny to a reference

  • reorderBySynteny = TRUE: order chromosomes in all genomes so that they maximize synteny to a reference genome
  • syntenyWeight = 0.5: either 0, .5 or 1, specifying whether synteny should be ignored (equivalent to reorderBySynteny = FALSE), partially weighted or fully weighted.
  • refGenome = NULL (see reorderBySynteny): no reference genome specified means that the first haploid (ploidy = 1) genome in the gsParam$genomeIDs vector is used
  • refChrOrdFun = frank(list(as.numeric(gsub("\\D+", "", x)), x), ties.method = "random"): order the reference genome chromosomes so that first chromosomes with smaller numeric values are on the left, then break the ties by chromosomes that come earlier in the alphabet are to the left, and if still tied, choose a random one.
  • customRefChrOrder = NULL: the reference genome chromosomes are ordered by these chromosome IDs instead of refChrOrdFun.
  • invertTheseChrs = NULL: keep gene order along a chromosome identical to that in the bed files
  • minChrLen2plot = ifelse(useOrder, 100, 1e5): since using gene order only show chromosomes that contain > 100 genes.

6.3 Fine-tune the appearence of the braids

  • braidAlpha = 0.5: 50% transparency of the synteny map
  • scaleBraidGap = 0: the braids should end exactly at the middle of the chromosome polygons.
  • palette = gs_colors: use the gs_colors R color palette, which goes from red to pink without any greens.

6.4 Alter how the chromosome polygons appear

  • gapProp = 0.005: The gap size between chromosomes of the largest genome are 0.5% of the entire genome length.
  • howSquare = 0: use a perfect circle on the ends of the chromosome polygons
  • chrExpand = 1: there should be a buffer 1x the hieght of the chromosome label text.
  • chrFill = "white: Chromosome polygons are white.
  • labelTheseGenomes = gsParam$genomeIDs: all chromosomes in all genomes are labeled
  • chrLabBorderLwd = 0.5: use lwd = 0.5 for the borders of the chromosomes.
  • inversionColor = NULL: if not NULL, color the inversions differently by this value.

6.5 Labeling and annotation parameters

  • chrLabFun = function(x) gsub("^0", "", gsub("chr|scaf|chromosome|scaffold|^lg|_", "", tolower(x))): strip some common strings off of the chromosome names
  • xlabel = sprintf("Chromosomes scaled by %s", ifelse(useOrder, "gene rank order", "physical position")): since default is to useOrder, the x-axis label is “Chromosomes scaled by gene rank order”.
  • chrLabFontSize = 5: the size of the chromosome labels.

6.6 Plotting device and overall parameters

  • pdfFile = NULL: do not write the plot to file, just print to the current graphic device.
  • forceRecalcBlocks = TRUE: should the blocks be re-calculated, even if a phasedBlk file exists?
  • scalePlotHeight = 1: scale plot height by the number of genomes
  • scalePlotWidth = 1: use 8” wide plotting window
  • addThemes = NULL: do not pass any additional custom ggplot2 themes to the plot

6.7 Parameters for overlayed regions on the genomic background

  • highlightBed = NULL do not highlight specific regions
  • backgroundColor = NULL: not used since highlightBed = NULL
  • blk = NULL: not used unless you have a specific region to plot.